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Partial  Likelihood  Analysis  of  Time  Series  Models,  with 
Application  to  Rainfall-Runoff  Data 

^  i  i 

by  Eric  Slud  and  Benjamin  Kedem 
Mathematics  Department 
University  of  Maryland 
College  Park  ,  MD  20742 

Abstract:  A  general  logistic-autoregressive  model  for  binary  time 
series  or  longitudinal  responses  is  presented,  generalizing  the  discrete¬ 
time  Cox  (1972)  model  with  time-dependent  covariates  as  well  as  the 
recent  regression  models  of  Kaufmann  (1987)  for  categorical  time- 
series.  Since  this  model  is  formulated  in  terms  of  time-series 
covariates  which  are  not  themselves  explicitly  modelled,  the  large- 
sample  theory  of  parameter-estimation  must  be  justified  by  means  of 
Partial  Likelihood  in  the  sense  of  Cox  (1975),  using  theoretical  results 
like  those  of  Wong  (1986).  The  large-sample  theory  also  justifies 
goodness  of  fit  tests  analogous  to  the  chi-squared  tests  of  Schoenfeld 
(1980)  and  to  the  tests  based  on  sums  of  (normalized)  squared 
residuals  used  in  logistic  regression.  These  ideas  are  illustrated  by 
analysis  of  a  rainfall-runoff  hydrological  dataset  previously  analyzed 
by  Yakowitz  (1987). 

1  Research  supported  by  the  Office  of  Naval  Research  under  Contract 
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PARTIAL  LIKELIHOOD  ANALYSIS  OF  LOGISTIC-AUTOREGRESSIVE 
AND  LONGITUDINAL  LOGISTIC  MODELS,  WITH  APPLICATION  TO 

RAINFALL-RUNOFF  DATA 

1  i 

by  Eric  V.  Slud  and  Benjamin  Kedem 
Mathematics  Department,  University  of  Maryland,  College  Park 

1.  Intrcxiuction.  A  generic  description  of  a  great  many  datasets 
collected  over  time  is  the  following:  (i)  a  univariate  time  series 

[or  a  set  of  independent  realizations  X1^  of  such  a  series]  is  of 

primary  interest;  (ii)  auxiliary  vectors  Z^  [  or  in  the  case  of 
multiple  realizations  ]  are  gathered  for  their  value  in  explaining  the 
behavior  of  X in  terms  of  information  observable  up  to  time  t-1  ; 
but  (iii)  no  probabilistic  models  are  available  concerning  the 
stochastic  evolution  over  time  of  the  explanatory  variables  Z^  , 
because  the  behavior  of  (ZJ  either  is  too  difficult  to  model  effectively 
or  is  not  of  direct  scientific  interest  to  justify  restrictive  assumptions. 
The  scientific  problem  is  then  to  understand  the  mechanism  by  which 
values  Xt  ,  t=l,...,T  ,  arise  from  the  antecedent  configuration  as  of 

time  t-1  .  The  practical  problem  in  hand  may  be  to  predict  ,  forecast  , 
or  classify  future  values  X^  from  data  available  up  to  t-1  ;  alterna¬ 
tively,  one  may  wish  only  to  estimate  or  test  hypotheses  about  certain 
parameters  describing  the  strength  of  dependence  of  X^  on  antecedent 
variables  Z^  ^  .  In  either  setting,  the  problem  can  be  formulated  as 
one  of  Inference  concerning  time-invariant  parameters  0  of  the 
conditional  law  of  X^  given  the  observable  data  j  as  of  time  t-1  . 
Thus  it  may  make  sense  to  perform  inferences  concerning  parameters  0 
describing  a  constant  causal  structure,  even  when  the  "inputs"  Z  , 
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from  which  Xt  arises  may  be  nonstationary.  In  very  different  contexts, 
this  idea  is  familiar  both  to  econometricians  ( cf .  the  discussion  of  Wold 
1959  concerning  exogenous  variables )  and  to  biostatisticians  (e.g.,  in 
the  Cox  1972  model  with  time- dependent  covariates) .  One  of  the  main 
purposes  of  the  present  paper  is  to  argue  that  the  same  formulation 
deserves  a  more  prominent  place  in  time-series  analysis. 

The  class  of  models  which  we  consider  in  this  paper  specializes  the 

previous  discussion  to  the  case  where  the  series  or  X1^  are 
binary ,  i.e.,  take  only  the  values  1  or  0  ,  and  where  the  conditional 

laws  of  Xj.  or  (  X1^  }  given  F  [the  cr-field  generated  by  all 

variables  X  and  Z  ,  or  X1  and  Z1  ,  for  O^s^t-1  ]  are  given 
s  s  s  s 

by 

-0Z,1  ,  -i 

p\(0)  =  Pp{  X't=l  |  Ft  l,  (XJt  ,  j*i)  }  =  {  1  +  e  W}  (*) 

where  the  real  vector  parameter  has  the  same  dimension  d  as  the 

covariate  vectors  Z1^  .  Here  all  vectors  are  taken  to  be  column- 
vectors,  and  •  denotes  inner-product.  Informally,  the  model  (*)  says 

that  the  variables  X1^  (  for  i=l,...,n  and  t=l,...,T  )  are  conditionally 
independent  for  different  i  given  F^  ,  and  satisfy  a  logistic 

regression  model  with  respect  to  the  covariates  Z1^  .  For  notational 

i  i 

convenience,  we  take  X^  =  X  ^  and  Z^Z  (  if  n-l  ;  and  we  take  the 

first  coordinate  of  the  covariate  vectors  Z^  to  be  1  ,  so  that  the 
intercept-coefficient  is  . 

In  the  model  (*),  t  is  always  a  discrete  tirne-variable;  the 


independent  time-series  replications  indexed  by  i  will  ordinarily 
correspond  to  independent  individuals  under  study;  the  dichotomous 

response-variable  X1^  will  often  indicate  that  the  Pth  individual  under 

study  failed  at  time  t  (on  test).  In  cases  where  the  number  n  of 
replicates  is  1  ,  may  indicate  that  Z  lies  inside  a  certain  subset 

of  R1^  ,  e.g.  that  a  specified  function  Vt  =  <p( Z^)  is  greater  than  0  . 

Finally,  the  vector  Z1^  of  covariates  for  the  Pth  replication  will  be 
observable  by  time  t  but  may  of  course  include  past  values  (before  t) 
of  time-series  observables  or  responses. 

The  logistic  form  (*)  for  the  conditional  probability  that  X1^.  =  1 
is  a  model  assumption  which  may  not  always  be  appropriate.  However, 
there  is  a  simple  argument  deriving  from  statistical  mechanics  which 
helps  to  justify  logistic  regression  but  does  not  seem  to  have  been  given 
before  by  statisticians.  The  proof  of  the  following  assertion  is  a 
straightforward  exercise  on  Lagrange  multipliers. 

Lemma  1  .i .  Suppose  that  the  random  variable  X  takes  values  0,1 
and  that  p.  =  P{  X=i  }  is  restricted  in  terms  of  the  constants 
("covariates")  Y(  *  Y0  to  satisfy  the  constraint  :  pi  Yt  +  p0  Y0  =  c 
for  a  given  constant  c  between  Y0  and  Yt.  The  probability 
p  =  pi  a  P(X=1)  which  maximizes  the  entropy  p  In  p  -  (1-p)  ln(l-p)  , 
is  given  by  P(X-l)  =  l/{  1  +  exp[  y  (Y|-Y0)]  }  ,  where  the  constant 
y  is  uniquely  determined  by  the  constraint. 

Z  Examples.  How  can  time-series  models  of  the  type  (*)  arise  ? 

First,  if  the  covariates  Z1^  for  all  t  and  i  can  be  regarded  either  as 
design-parameters,  or  more  generally,  as  known  at  time  0  ,  then 


conditionally  given  the  entire  set  of  covariates,  the  responses  X1^ 

follow  an  ordinary  logistic  regression  model.  However,  the  models  (*) 
are  much  more  interesting  and  subtle  when  the  covariate-vectors  include 

information  about  past  values  X1  for  s^t-i  or  about  values  Y1  of 

s  s 

time-series  from  which  X1  is  derived. 

s 


2. A.  Linear-  and  Logistic-  Autoregressive  Models 

Consider  the  case  of  a  single  time-series  Y  in  terms  of  which  the 

1 

indicator-series  X^  =  X  ^  is  defined.  For  example,  we  may  wish  to 
analyze  the  level-crossing  behavior  of  a  continuous -valued  time  series 
Yj.  ,  in  which  case  we  might  take  X^  =  Ijy  ^  for  some  constant  r  . 

The  model  (*)  can  arise  in  this  context  in  a  very  simple  way.  Suppose 
that  Y^  were  an  Autoregressive  time-series  of  order  p  ,  i.e.,  that 

V<Vro-y.  Yt_i  '  '  *p  Yt-p>  7  x 

X  X  2 

is  independent  of  {  Y  :  s^t-1  }  with  the  Logistic  density  e  /(1+e  )  . 

s 

Then  it  is  easy  to  check  for  each  fixed  r  e  R  that  X^  =  Ijy  ^rj 
satisfies  (*)  with  ^  t-1 t-p^  7  ’  ^~P+*  »  anc^ 

P  =  (To  r,yi,...,y_  )  '/\  .  In  other  words,  the  AR(p)  model  with  logistic 

P 

errors  for  Y^  implies  (*)  for  X^  =  Ijy  >rj  for  all  r  .  Conversely,  if 


(*)  holds  for 


Ypr] 


for  each  r 


with  the  corresponding  coefficients 


differing  only  through  -  (y0-r)/X  ,  then  Y^  is  AR(p)  with 
logistic  errors.  A  worthwhile  extension  of  Tie  foregoing  is  to  take 


4 


€j.  =  (Y^-  (3  Z  t)/A  independent  of  {  Y  ,  U  :  s^t-i  }  where  Z  = 

(ZAU^/)7  ,  with  Z^  defined  as  above  and  an  auxiliary  covariate 

series  observable  at  the  same  time  as  .  We  do  not  know  any  models 
essentially  different  from  these  for  continuous  series  Y^  ,  in  which 


I[Y  >r]  sat^s^es  (*)  f°r  more  than  one  value  of  r. 


The  model  (*)  for  Ij-y  >rj  for  a  fixed  (known)  r  is  a  much 


weaker  restriction  on  Y^  and  Z^  than  the  assumption  that  Y^  is 

AR(p)  with  symmetric  errors,  since  it  leaves  the  conditional  law  of 
Y^  -  r  given  s^t-1}  and  >rj  completely  unspecified.  In  this 

way,  it  is  more  general  than  the  categorical  time-series  models  of 
Kaufmann  (1987)  .  In  Section  5,  we  describe  a  dataset  where  a  long 
continuous-valued  series  Y^.  does  seem  to  obey  a  model  (*)  for  I^y  >rj 

with  some  fixed  values  of  r  ,  but  does  not  seem  to  fit  an  AR(p)  model. 


2.B.  Discrete-time  Cox  models  with  time-dependent  covariates  and 
multiple  event-times.  General  models  of  the  type  (*)  have  appeared 
before  in  the  context  of  Survival  Analysis  [  Cox  1975;  Andersen  and 

Gill  1982;  Arjas  and  Haara  1987  ]  .  In  this  setting,  is  the 

indicator  of  the  event  that  the  i’th  individual  under  study  fails  at  time  t 
[  or,  as  in  the  more  usual  continuous-time  context,  has  failed  before 

time  t  ]  .  The  vector  Z1^  of  time- dependent  covariates  contains 

all  the  relevant  prognostic  information  for  failure  either  at  time  t+1 
or  in  the  period  (t,t+l]  .  What  is  worth  emphasizing  here  is  that 


models  (*)  can  also  accomodate  "failures"  that  may  happen  more  than 
once  to  the  same  individual,  such  as  recurrences  of  some  disease-state 
like  a  clinically  detectable  tumor  .  (See  Gail,  Santner,  and  Brown  1980 
for  the  most  notable  previous  effort  to  model  multiple  times  to  tumor). 

For  such  applications,  Z1^  may  have  the  same  medical  measurement 
occurring  in  more  than  one  component  according  to  how  often  the 
individual  has  previously  "failed",  i.e.,  according  to  the  number  of 

values  s=i,...,t  for  which  X1  =1  .  The  purpose  of  such  a  definition  is 

to  allow  different  coefficients  fi  to  operate  when  previous  failures 
have  occurred. 

Other  applications  of  models  like  (*)  in  the  case  of  relatively  few 
long  data-records  have  been  given  by  Pons  and  de  Turckheim  (1985)  . 


For  them,  X1^  indicates  that  the  Pth  animal  (rabbit)  under  study  has 
eaten  something  at  time  t  ,  and  Z1^  includes  information  about  times 


and  amounts  of  previous  feedings.  Their  method  of  inference  explicitly 
allows  for  an  additional  undetermined  (periodic)  nuisance  factor  h0(t) 
not  depending  on  p  in  the  expression  for  the  conditional  probability 
pt(0  in  (*). 


3.  Partial  Likelihood.  Regularity  conditions  for  large-sample  theory. 

The  remarkable  thing  about  model  (*)  is  that  under  mild  regularity- 
conditions  on  the  large-sample  behavior  of  the  covariate-processes 

(Z1^:  i=l,...,n,  t=0,...,T)  as  n'T  — ►  oo  ,  there  is  a  numerically  stable 

estimator  of  p  which  is  consistent  and  asymptotically  normal  with 


easily-estimated  variance-covariance  matrix.  The  theoretical  apparatus 
used  to  justify  these  statements  is  the  Partial  Likelihood  of  Cox (1975) 
as  developed  by  Wong  (1986)  and  expounded  by  Slud(1988).  The  notion 
of  Partial  Likelihood  ,  specialized  to  our  setting,  requires  precisely 
what  we  have  so  far  assumed,  namely  a  model  for  each  t  in  terms  of 
the  same  finite-dimensional  parameter  (3  ,  for  the  conditional 

likeliftood  of  the  observable  response  data  (  X1^:  i=l,...,n  }  given  the 

(  a  field  F ^  representing  the  )  collection  of  all  data  observable  before 

time  t  .  The  Partial  L  ikelihood  is  simply  the  product,  over  times  t 
when  observations  are  taken,  of  these  conditional  likelihoods.  In  our 
setting,  the  conditional  likelihoods  are  given  by  (*)  together  with 

1  n 

conditional  independence  of  X  ^  ,...,  X  ,  and  the  partial  likelihood  is 

T  n  x1  1~X^ 

PL(/6t;  E  li  n  [pTfl  1  (1-p \m  '  ] 

t-1  i-1 


The  maximizer  fl?  of  PL(jS!  is  called  the  Maximum  Partial  Likelihood 
Lstimat.or  (MPLE)  of  /?  .  Its  large-sample  behavior  under  (3  -  is 
studied  with  the  aid  of  the  score-statistic  (d  vector)  process 
t  n 


st(p; 


-A 


2  2  x 

i-i  l-i 


(X‘s  ■  p‘s(j?)l  ,  |WRd  ,  t=i . T 


S'  1 


which  at.  t-~T  is  simply  the  gradient  in  fi  of  log  PL[{3)  .  Since  S^[{3) 
is  const ructed  ’o  be  a  martingale  when  f3:  (3Q  ,  the  expression 

up'  2  2  'X  y\‘  p \'p)  I  i  f ',<p) ) 


simultaneously  plays  the  role  of  Hessian  matrix  for  -  log  PL{p)  with 
respect  to  p  (and  is  thus  an  irformation  matrix  for  estimating  p)  , 
as  well  as  the  cumulative  conditional  variance- covariance  for  S ^{p)  at 
Wo  and  t=T  .  The  large-sample  theory  of  MPLE’s,  including  the 
underlying  martingale  theory,  is  described  in  a  general  setting  by 
Wong(1986)  and  Slud(1988,  Chapter  6),  and  in  successively  greater 
and  greater  generality  in  the  setting  of  (*)  by  Andersen  and  Gill 
(1982),  Wong  (1986)  ,  and  Arjas  and  Haara  (1987).  Related 
asymptotic  results  on  MPLE’s  in  (*)  for  large  T  and  fixed  n  were 
obtained  by  Slud  (1984),  Pons  and  de  Turckheim  (1985),  Wong  (1986)  , 
Arjas  and  Haara  (1987),  and  Slud  (1987).  Because  the  ideas  underlying 
Consistency  and  Asymptotic  Normality  of  MPLE’s  are  very  similar  in 
all  the  papers  cited,  based  on  the  martingale  Central  Limit  Theorem  for 

(r.T)^  Sp(^o)  and  stability  in  probability  of  J(0)/(nT)  ,  we  do  not  give 
any  proofs.  We  content  ourselves  with  some  general  comments,  a 
precise  set  of  regularity  conditions,  and  statements  of  results  with 
some  small  extensions  which  are  helpful  in  time-series  applications. 

Since  PL(0)  is  almost  surely  a  cc  ive  random  function  of  pe R^, 
with  strict  concavity  under  an  assumption  of  asymptotic  nonsingularity 
for  l{p)  ,  there  is  no  need  to  treat  the  possibility  of  multiple  MPLE’s 
for  large  nT  .  Moreover,  concavity  removes  the  need  to  establish 
uniform  convergence  [either  a. s.  or  in-probability]  of  l{p)/{ nT)  over 
compact  neighborhoods  of  /?’ s  when  the  corresponding  pointwise 
convergence  to  a  continuous  concave  limit  is  known  (Andersen  and  Gill 
1982,  Appendix  II)  .  Our  regularity  conditions,  chosen  for  simplicity 
rather  than  utmost  generality,  are  : 


B 


(C.l)  The  covariate-vectors  almost  surely  lie  in  a  nonrandom 

compact  subset  T  of  ,  and  the  probability  measure  P  governing 
{  :  i=l,...,n  ,  t=0,...,T  }  obeys  (*)  with  /?=/?0  . 


(C.2)  There  is  a  probability  measure  v  on  Ra  for  which 

/  z  7!  j/( dz)  is  positive-definite,  such  that  under  (*)  with  > 

1  T  n  p  j 

— 2  2  I  — »i/(A)  for  all  Borel  sets  A  C  Ra  as  nT->oo 

nl  t=l  i=l  [Z^eA] 


Under  assumption  (C.2),  which  says  that  the  empirical  measure  of  the 

set  {  Z1  :  0^s<T,  i^i^n  }  converges  in  a  certain  sense  to  a  nonrandom 
s 

measure  v  ,  it  follows  that  for  every  continuous  function  g  from  R^ 
to  R  (which  is  necessarily  bounded  on  the  compact  support  T  of  Zj  ) 
-1  T  n  .  p  f 

(nT)  2  2  g(Z/t  )  - ►  j  g(z)  j/(dz)  as  n— >00 

t=i  i=i  1-1  JRa 

From  this  convergence,  it  is  easy  to  see  that  the  nonrandom  continuous 

matrix-valued  function  A(b)  defined  for  each  b  e  R^  as  the  limit  in 
probability  of  /(b) / (nT)  is  given  by 

r  b-Z 

A(b)  =  — - — r— rr  (  z  z'  )  v(dz)  (3.1) 

jRd  (1  +  e°  z) 

The  matrix  A(b),  which  is  positive  semi-definite  by  inspection  for 
every  b  ,  is  also  nonsingular  at  every  b  by  the  hypothesis  on  u  .  . 
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Theorem  A.  Under  assumptions  (C.  i)-(C.2)  ,  the  MPLE  ff*  is  almost 
surely  unique  for  all  sufficiently  large  nT  ,  is  consistent  in  probability 

for  (3q  as  nT  ->00 ,  and  satisfies  (nT)  i  (Pp-/W  -^N|0,AW'V 

In  addition,  (nT)  *  (0P-0o)  -  (nT)"*  A(/S0)  ‘  ST(/?„)  — K)  as  nT->co  . 

/V 

The  large-sample  behavior  of  (3 P  described  here  is  based  on  the 
assumed  stability  of  information  /(/?)/ (n T)  per  observation  .  The  only 
real  novelty  in  our  presentation  is  that  a  single  theoretical  result  is 
made  to  encompass  both  the  cases  where  n  becomes  large  and  those 
where  T  does.  Other  possible  limiting  distributional  behavior  is 
possible  in  so-called  nonergodic  cases  (Basawa  and  Scott  1983)  where 
normalized  information  converges  in  probability  to  a  random  limit  . 

The  reason  for  referring  to  (C.2)  as  an  ergodic  setting  is  that  it  is 
follows  from  the  Birkhoff  Ergodic  Theorem  either  in  the  case  of  indepen¬ 
dent  identically  distributed  processes  (Z1^:  O^t^T)  for  i=l,...,n  with 

T  fixed  ,  or  of  an  ergodic  and  stationary  process  (Z  ^...jZ1^)  in 
t=0,i,...  with  n  fixed. 

In  attempts  to  fit  (*)  to  data,  it  is  crucially  important  to  be  able  to 
assess  the  quality  of  fit  of  the  model  with  parameters  f3  in  terms  of 

the  residuals  X1^  -  p1^ (/9)  .  These  are  precisely  the  residuals  which  one 

uses  in  ordinary  logistic  regression,  and  two  sorts  of  goodness-of-fit 
statistics  based  on  them  are  of  particular  value.  The  idea  of  the  first  is 

to  check  the  observed  number  of  responses  X1^!  corresponding  to 


Z1^  |  within  each  of  a  number  of  covariate-defined  cells  against  the 


cumulative  one-step-ahead  predicted  number.  The  resulting  chi-squared 
statistic  is  closely  related  to  the  one  given  in  a  Survival  context  by 
Schoenfeld  (1980).  Our  second  type  of  statistic  is  a  sum  of  squared 
normalized  residuals.  The  theory  underlying  both  statistics  is  c  ntained 
in  the  following  two  theorems,  which  follow  readily  from  Theorem  A 
and  several  applications  of  the  Martingale  (multivariate)  Central  Limit 
Theorem  (as  given  for  example  in  Andersen  and  Gill  1982,  Appendix  I). 

In  both  Theorems,  it  is  crucial  that  the  set  of  possible  p1^)  values 
for  all  i  and  t  and  all  0  in  a  neighborhood  of  lie  in  a  nonrandom 
compact  subset  of  (0,1)  .  This  fact  follows  from  (C.  1)  . 

d 

Theorem  B.  Let  Ct  ,  ...  ,  Cr  be  a  partition  of  Ra  into  measurable 

sets,  and  define  N.  and  E.(/9)  for  j  =  l,..,k  through  the  formulas 

J  J 


n  T 


n  T 


V2  2  i|Z‘  eC  l  x‘t  >  Ei(W  E  2  2  i[zi  eC  i  p\W 

J  i=l  t=i  -  t - 1  j 1  1  J  i=l  t=l  lzu-i  j1 


Put  N  =  (Nt,. . .N^)'  and  E(/?)  =  (E1(^),...,Ej<(^))/  .  Then  for  fixed  k 
(nT)~*  (  (N  -  gmv  ,  (#?p-Po)'  )'  — N(  0  ,  2  ) 

f  A  Br 

where  2  is  a  square  d+k  dimensional  matrix  of  the  form 

Ir  nJ 


lB  Dj 

Here  A  is  kxk  and  diagonal  with  j’th  diagonal  element 

2  r  Jo'*  -i 

a .  =  — - 2  Wdz)  ,  D  =  A(£o)  is  dxd  ,  and  the  j’th 

J  L.  (l+ePo'z) 

J 

r  sPo'z 

column  of  B  is  given  by  - P - 2  Dz  i/(dz)  .  In  addition  , 

>C.  (l+ep°z) 

J 


p 


00  . 


(nT)’*  (  E(/?P)-E(jS0>  )  -  (nT)’*  B'  D_‘  (0p-ft,)  — P->  0  as  nT  — ► 

k  2  2 

It  follows  that  as  nT— Ko,  both  the  statistics  2  (N  . -E  .  (jf?o) )  /o  .  and 

jnl  J  J  J 

(N-E(f)P))'  (A-B' d  'b)"'  (N-E(fiP))  are  asymptotically  ^'distributed. 


Theorem  C.  Under  hypotheses  (C.  i)  and  (C.2)  ,  let  a^O  be  fixed 
arbitrarily,  and  define  v(i,t)  =  p^(fio)  (l~p\(fio))  for  all  i,  t  .  Then 


n  T 

2  2 

i=i  t=i 


v{i,t)a 


-  [v(i,t)] 


1-a 


D 


o  T \-2a  V2 

2:  2  (Wi.t)}1  •{  1  -4v(i,t) } 

1=1  t=l 


-v  N{ 0,1)  as  nT — ►  00. 


The  result  of  Theorem  C  is  equally  valid  when  the  residuals  and 

/\ 

v(i,t)  are  calculated  with  0O  replaced  by  However,  in  applications 

where  predicted  response-probabilities  p^(^o)  can  get  very  close  to  0 

or  1  ,  the  behavior  of  the  normal  deviates  in  Theorem  C  may  not  be 
very  reliable  . 

4.  Asymptotic-efficiency  calculations.  Extensions  to  other  link-functions. 

How  efficient  is  a  data  analysis  based  on  (*)  and  the  Partial  Likelihood 

when  a  fully  specified  model  for  (X^Z1^)  is  available  ?  While  no 
general  answer  to  this  question  is  possible,  we  can  calculate  the 


asymptotic  relative  efficiency  in  the  AR(p)  Example  of  §  2. A  ,  where 
for  a  fixed  r  ^  0  , 

Xt  =  X  t  =  I[Yt^r]’  ^t-l  =  Z  t-1  =  (1,Yt-r"',Yt-p) 

P  =  ((ro-D/X,  r,/A,  ....  Yl/X),)  £t  =  (  Yt  -  l  )  /  X 

and  is  assumed  to  be  independent  of  F^  =  a(  Zs:  s<t  )  and  to  have 

X  X  2 

the  logistic  density  f(x)  =e  /(1+e  )  .  Taking  the  initial  data  Z0  to  be 
given,  one  can  easily  derive  the  likelihood  Lift)  for  the  observed  data 

{  Yt  as  well  as  the  limiting  information  about  the  parameter 

vector  (3  per  observation.  This  information-matrix,  equal  to  the 
inverse  of  the  asymptotic  variance  matrix  for  the  maximum-likelihood 
estimator  of  (3  when  the  true  parameter  value  is  pQ  ,  is  given  by 

m  =  £fl{-~r5  zz'l  =  4£fi( zz') 

P°  (  l+ee)  P° 

where  ep*  denotes  expectation  when  the  random  (p+ 1 ) -vector  Z  has 

the  stationary  distribution  for  Z^  (assumed  to  exist)  as  t— ►oo  ,  and 
e  is  a  logistic  random  variable  independent  of  Z  .  The  corresponding 
"partial  likelihood"  information-matrix  arising  in  Theorem  A  above  was 
derived  in  (3.1)  as 

D  AZ 

lP(P o)  =  A()So)  =  Eft{  0-Zji  z  Z'  }  =  EpJ  f(ft,-Z)  Z  Z'  1 

Since  the  function  f(x)  has  maximum  value  i  ,  it  follows  immediately 
from  the  last  two  formulas  that  for  any  vector  b  e  , 
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t/  /P(w  b  s  3  b'  np0)  b 


Thus,  any  scalar  parameter  derived  from  fi  can  be  estimated  with 
asymptotic  relative  efficiency  (ARE)  at  best  $  via  the  Partial- 
Likelihood  Logistic-Regression  method  as  compared  with  a  maximum- 
likelihood  AR(p)  analysis.  Of  course,  the  ARE  could  (for  some  AR(p) 
models)  be  much  worse  than  $  ,  but  it  should  be  somewhat  reassuring 
that  the  cases  where  relative  efficiency  is  worst  are  the  cases  where  the 
predictors  p^(fi0)  spend  most  of  their  time  close  to  1  or  0  ,  i.e.  the 
cases  where  prediction  is  very  good  ! 


The  foregoing  analysis  of  model  (*)  has  many  possible  generaliza¬ 
tions.  One  that  is  especially  worth  pursuing  is  to  allow  the  predictors 

in  model  (*)  to  take  some  other  parametrically  specified  form 

F(fi-Z1l _^)  as  a  function  of  a  linear  expression  in  the  explanatory 

variables  Z1^  .  One  choice  for  the  "link"  function  F  ,  which  leads  to 

methods  bearing  the  same  relation  to  probit  regression  as  our  previous 
analysis  does  to  logistic  regression,  is  the  standard  normal  distribution 
function  F  =  <I>  .  The  model  from  which  conditional  likelihoods  are 


constructed  is  now 


V  xr>  I  Li  •  <xjt  •  J*»  I  =  p>  ; 


Virtually  every  aspect  of  our  analysis  of  (*)  has  an  analogue  for  (**). 
The  AR(p)  time-series  examples  of  (**),  with  i  Ijy  ,  will  now 

have  Gaussian  error-distribution.  The  proofs  of  Theorems  analogous  to 
A,  B,  and  C  are  very  similar  to  those  in  the  logistic  case,  except  for 
some  extra  complication  since  the  log  Partial  Likelihood  will  no  longer 


i 


g 


be  a.s.  concave  for  finite  data-samples.  Finally,  the  ARE  calculations 
of  this  Section  can  be  carried  out  for  the  AR(p)  examples  of  (**)  and 
lead  to  formulas 

=  Ep{  Z  Z' }  ,  =  Ep[  Z  Z'  ^-4)S0:Z) ) 

where  Z  now  has  the  stationary  distribution  of  in  an  AR(p)  model 

with  normal  errors,  and  <j>  is  the  standard  normal  density.  The  highest 
possible  value  for  ARE  in  estimating  scalar  parameters  related  to  fi  is 
now  2/tt  ,  and  again  the  ARE  can  be  much  worse  than  that  in  cases 
where  prediction  is  very  accurate. 

5^  Application  to  Rainfall -runoff  data.  We  now  describe  the  results  of 
applying  model  (*)  to  the  analysis  of  a  hydrological  dataset  which  has 
previously  been  analyzed  by  quite  different  methods.  The  data,  which  is 
described  by  Yakowitz  (1987),  consist  of  daily  measurements  by  the 
National  Weather  Service  of  rainfall  and  runoff  in  the  Bird  Creek  Ohio 
watershed.  Data  were  collected  for  a  13-15  week  period  during  each  of 
the  years  1939-1964.  The  purpose  of  collecting  such  data  seems 
partly  to  have  been  to  develop  models  for  the  prediction  of  flooding. 

Thus  we  regard  daily  runoff  as  the  (continuous-valued)  response 


variable,  with  past  runoff  together  with  current  and  past  values  of  rain¬ 
fall  playing  the  role  of  explanatory  variables.  Since  flooding  is  of 
interest,  it  is  natural  to  try  to  understand  the  relationships  between 
level-exceedances  =  I^y  >rj  and  the  explanatory  variables.  Our  use 

of  this  example  is  not  intended  primarily  to  develop  a  formula  for 
prediction,  but  rather  to  illustrate  how  models  (*)  may  be  estimated 


and  shown  to  pass  tests  of  adequacy  in  a  particular  situation  where  the 
linear  autoregressive  model  turns  out  not  to  be  appropriate. 

For  our  data  analysis,  we  split  the  26  years  of  data  into  a  training 
set  [  the  years  1939-48,  consisting  of  1031  rainfall-runoff  pairs  ] 
and  a  testing  set  [  the  years  1949-64,  consisting  of  1691  observation- 
pairs  ].  Since  the  models  we  contemplated  involved  explanatory 
variables  defined  from  (Y^ *\-2»^t-3*\-4*Rt*Rt-l  ’^t-2’^t-3^  *  our 

covariate-data  for  the  first  four  response-variables  Xt  in  each  year 
were  incomplete.  Deleting  these  observations  led  to  a  complete  dataset 
of  991  observations  for  model-fitting  and  1627  observations  for  test¬ 
ing  model  adequacy.  The  cutoff  threshold  r  used  in  defining  the 
indicator  response  was  chosen  to  be  1  or  3  (cubic  ft/sec.) 

These  levels  respectively  corresponded  to  244  and  56  positive 
responses  [i.e. ,  values  above  r  =1,  3  j  out  of  991  in  the  fitting- 

dataset,  and  to  401  and  87  out  of  1627  in  the  test  data. 

After  some  preliminary  fitting  [via  maximum  partial-liklihood], 
plotting  of  residuals,  and  computation  of  partial  likelihood-ratio 
statistics  for  the  1939-48  data,  we  found  that  the  important  covariates 
in  model  (*)  for  Ijy  ^rj  (r=i  or  3)  were  ,  Y^  ,  , 

Rt-1  ’  Rt*Rt-r  Rt-2’  Rt-l*Rt-2’  Yt-2*  and  Yt-l*Yt-2  *  AlthouSh  the 
mechanics  of  soil-saturation  naturally  suggest  that  interaction-terms 
involving  successive  days’  rainfall  and  runoff  should  play  some  role  in 
explaining  future  runoff,  previous  linear  —  as  opposed  to  logistic  — 
regression  analyses  do  not  seem  to  have  included  such  terms.  As  it 
turned  out,  not  all  the  terms  listed  above  had  significantly  large 
coefficients  for  each  threshold  r  ,  but  no  other  terms  appeared  to  play 


of  a  role.  In  particular,  our  attempts  to  discover  covariates  to  account 
for  year-to-year  differences  in  runoff  —  and  such  differences  do  appear 
to  be  real  —  did  not  produce  new  covariates  worth  including  in  the 
model.  For  interpretability  of  coefficients,  we  continued  to  include 
separate  terms  corresponding  to  each  significant  interaction-term. 

The  values  0?  and  standardized  values  /?.^/[Va of  the 

fitted  coefficients  0  in  model  (*)  are  exhibited  in  Table  1,  for  each  of 
the  cases  =  I^y  ^rj  with  r=l  and  r=3  .  In  each  case,  the  10- 


dimensional  covariate  vector  is 


Zt's 


(5.1) 


^ 1  ,Rt’ Yt- 1  ,Rt*Yt- 1  ’Rt- 1  ,Rt*Rt- 1  ,Rt-2,Rt- 1  *Rt-2,Yt-2’Yt- 1  *Yt-2  ^ 

The  most  important  covariates  are  :  the  intercept  term  ,  the  interaction 
terms  R^.,  ,  R^R^  ,  Y^Y  2  ,  as  well  as  the  obviously 

important  covariates  ,  Y^  ,  and  R^j  and/or  R^  .  The  log- 

likelihoods  for  the  fitted  models  (on  the  991  training-observations) 
were  -142.7  for  r=l  and  -71.5  for  r=3  . 

Motivated  by  the  results  of  Yakowitz  (1987),  we  tried  also  to  fit 

A 

models  in  which  the  linear  residuals  Y  -  19*Z  =  e  for  s=t-l  and 

s  r  s  s 

s=t-2  from  a  preliminary  fit  would  serve  as  additional  covariates 
[i.e.  as  additional  components  of  an  augmented  Z-vector.  ]  From  the 
perspective  of  partial  likelihood-ratio  tests  performed  on  the  training 
data  alone,  the  coefficients  of  these  additional  covariates  were  on  the 
borderline  of  significance  ;  the  maximized  partial  likelihoods  [  for  12- 
dimensional  covariate-vector  vs.  the  previous  10-dimensional  covariate- 


vector  (5.1)  ]  were  respectively  -140.2  for  the  case  r=i  and 
-68.2  for  r=3  .  [Thus  twice  the  log  partial  likelihood  ratios  for  the 
two  cases  are  5.0  and  6.6  ,  which  should  be  compared  with  the  .05 
percentage-point  of  5.99  for  the  X2  -distribution  .  The  validity  of 
partial-likelihood  ratio  tests  in  this  context  follows  from  Theorem  A  by 
standard  arguments  with  the  asymptotic  multivariate-normal 

A 

distribution  of  ft?  .]  The  slight  advantage  of  the  extra  two  covariates 
disappeared  when  we  applied  the  ten-  and  twelve-co variate  models  to  the 
testing-dataset  (the  1627  observations  for  years  1949-1964).  Therefore 
we  confine  our  further  discussion  to  the  fitted  models  (*)  based  on 
covariates  (5.1)  alone  . 


The  previous  attempts  surveyed  by  Yakowitz  (1987)  to  fit  models 
to  rainfall-runoff  data  involved  linear  (auto-)  regressions,  with  or 
without  moving-average  error  terms.  As  we  have  seen  above  in 
Section  2  ,  if  the  error-terms  were  approximately  logistically 
distributed  then  we  should  expect  the  logistic  autoregressive  models  fit 
with  different  thresholds  r  to  share  the  same  important  covariates  and 
coefficients  (other  than  the  intercept-term) .  Even  a  cursory  inspection 
of  the  fitted  coefficients  in  Table  1  shows  that  this  is  not  the  case  for 


our  rainfall-runoff  data.  This  is  indirect  evidence  for  inadequacy  of  a 
linear-regression  model.  Indeed,  after  numerous  linear- regression  fits 
with  different  sets  of  covariates,  we  convinced  ourselves  that  prediction 


[Y  >r]  was  muc^  ^ess  g°°d  linear  than  with  logistic  models. 


To  begin  to  assess  model  adequacy  for  the  logistic  models  (*) 
fitted  to  the  1939-1948  rainfall -runoff  data,  we  calculated  goodness-of- 
fit  statistics  as  described  in  Theorems  B  and  C  of  Section  3  ,  replacing 


the  (unknown)  parameters  Pq  in  those  statistics  by  their  estimators  fP 

2 

based  on  the  1939-1948  data  .  The  k  (=27)  cells  used  to  define  the  X|< 
statistic  of  Theorem  B  were  based  on  arbitrary  cutoffs  0.004  and 

0.008  for  R^.  ,  0.01  and  0.02  for  and  0.5  and  1.0  for 

2 

Y.  <  ,  and  the  variances  a .  of  Theorem  B  were  estimated  from  the 

t-1  j 

data  as  described  in  Table  2  below.  The  goodness-of-fit  statistics  were 

A 

calculated  (using  the  same  estimated  p^  )  both  on  the  training  dataset 

(1939-1948)  and  on  the  testing-dataset  ( 1949-1 964) »  with  illuminating 

2 

results.  The  goodness-of-fit  chi-squared  statistics  Xg  based  on  a 


Table  1.  Coefficients  and  Standardized  Values  for  Model  (*)  fitted  to 
Rainfall-Runoff  Data  for  Years  1939-48  with  covariates  (5.1). 


r= 

1  case 

r=3 

case 

Coefficient 

index 

/\ 

standardized 

/V 

p? 

standardized 

1 

-5.933 

-8.10 

-5.655 

-9.59 

2 

72.49 

1.81 

92.86 

4.48 

3 

3.633 

5.63 

-0.397 

-1.97 

4 

302.2 

3.97 

99.37 

4.39 

5 

14.68 

0.44 

121.46 

4.20 

6 

6985. 

1.59 

-3139 

-2.14 

7 

-59.87 

-2.22 

43.97 

0.94 

8 

11504 

-2.56 

-2422 

-1.88 

9 

0.630 

-0.87 

-0.42 

-0.82 

10 

1.022 

2.04 

0.128 

1.87 
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decomposition  of  the  covariate-space  into  3  =27  cells  should  be 

compared  with  a  quadratic-form  in  multivariate-normal  random 

2 

variables  which  is  stochastically  smaller  than  \21  on  the  training-data 

[due  to  having  estimated  0O  from  the  same  data]  and  stochastically 

2  ~ 
larger  than  on  the  testing-data  [since  the  estimated  fP  is  approxi¬ 
mately  independent  of  the  deviations  N  -  E(0O)  for  the  testing-data]  . 

2 

In  the  notation  of  Theorem  B  ,  the  expectation  of  Xg  °n  the  training- 


data  is  k  -  A  and  its  expectation  on  the  test-data  is  k  +  A  ,  where 
k  -i  2 

A  =  2  (B/D  B)  . .  /  o.  .  In  our  example,  with  k=27>  the  (empirically 

J=1  JJ  J  i?  3.0 

estimated)  value  of  A  turns  to  be  when  r=l  and  when 
r=3  .  The  different  behavior  of  A  for  the  different  thresholds  r 

appears  to  be  due  to  the  very  different  predicted  response-probabilities 

/\ 

pt(0^)  which  arise  in  the  models  for  different  r  with  a  given  covariate- 


covariate-vector  Zt  .  Table  2  indicates  that  according  to  all  three  of 
our  goodness-of-fit  statistics,  the  models  (*)  with  =  Ijy  for 


both  r=l  and  r=3  are  adequate  for  the  training  data  and  that  the 
corresponding  models  with  coefficients  fitted  to  the  training  data  (years 
1939-48)  are  seriously  inadequate  for  the  the  test-data  (years  1949- 
64) .  This  need  not  be  too  distressing,  since  there  do  seem  to  be 
noticeably  different  rainfall-runoff  patterns,  including  differences  in 
overall  runoff  levels,  from  year  to  year,  although  the  differences  are 
not  systematic  enough  to  be  easily  encoded  into  additional  covariates  . 
Some  care  is  needed  in  interpreting  the  results  using  the  statistics 


1  A  2  A  ^ 

2  (X,-Pt(^))  /[  pf ( 1 -p, (^P) )  ]  from  Theorem  C  ,  since  many  of 
t-i  1  1  1  1 

/V 

the  logistic  prediction-probabilities  based  on  either  dataset  are 

quite  close  to  zero  or  one.  For  this  reason  ,  the  results  in  Table  2  for 
Wr  n  are  probably  more  reliable  than  those  for  Wr  ,  . 


Table  2  .  Goodness-of-fit  statistics  on  traini 


Model  (*)  with  coefficients  displayed  in  Table  1.  The  chi -squared 

2  A 
statistic  Xg  is  defined  as  in  Theorem  B  with  ft,  replaced  by  fP  , 

2  a2  T  a  a 

with  o  estimated  by  a.  =  T  2  W  1  Pt  m  ll-p APn)  ,  and 
J  J  t=i  [  t-i  y 

with  27  partition-cells  C.  defined  as  the  intersections  of  all  sets  R.e 

J  1 

[0,.004]  ,  (.004, .008]  ,  or  (.008, oo)  ;  R^+Rj.  e  [0,.01]  ,  (.01, .02], 

or  (.02, oo)  ;  and  e  (0,.5]  ,  (.5,1]  ,  or  (l,co)  .  The  statistics 

Wr  for  a  =  0  and  a=i  are  the  asymptotically  normal  deviates 
u,a 

A 

defined  in  Theorem  C  ,  with  ft,  replaced  by  fP.  In  all  statistics, 

A  A  2 

estimators  /r  and  a  are  calculated  from  training-dataset  (*39-*48). 

J 


Statistic 


Training-data  (991  obs.) 


Test-data  (1627  obs.) 


2 

r=l 

r=3 

r=l 

r=3 

*B 

31.0 

7.47 

98.8 

52.6 

WC,0 

0.47 

-0.03 

7.4 

2.4 

WC,1 

-0.01 

1.14 

21 

44.9 

18.8 

£ 


A  further  interesting  conclusion  from  fitting  models  (*)  to  the 
1939-1948  rainfall-runoff  data  of  Yakowitz  (1987)  is  that  while  these 
models  appear  to  fit  reasonably  well  both  with  the  level -crossing 
thresholds  r=l  and  r=3  ,  they  do  not  in  this  example  appear 
compatible  with  a  single  linear  model  for  runoff.  In  this  sense,  we 
have  a  >~eal  example  of  (*)  different  from  the  examples  of  Section  2. A 
which  derive  from  AR(p)  models  with  logistic  errors. 
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